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We investigate the detectability of modules in large networks when the number of modules is not 
known in advance. We employ the minimum description length (MDL) principle which seeks to 
minimize the total amount of information required to describe the network, and avoid overfitting. 
According to this criterion, we obtain general bounds on the detectability of any prescribed block 
structure, given the number of nodes and edges in the sampled network. We also obtain that the 
maximum number of detectable blocks scales as y/N, where N is the number of nodes in the network, 
for a fixed average degree (k). We also show that the simplicity of the MDL approach yields an 
efficient multilevel Monte Carlo inference algorithm with a complexity of 0(tN log N), if the number 
of blocks is unknown, and 0(tN) if it is known, where r is the mixing time of the Markov chain. 
We illustrate the application of the method on a large network of actors and films with over 10 6 
edges, and a dissortative, bipartite block structure. 



The detection of modules — or communities — is one 
of the most intensely studied problems in the recent lit- 
erature of network systems [HE]- The use of generative 
models for this purpose, such as the stochastic block- 
model family [3--21J, has been gaining increasing atten- 
tion. This approach contrasts drastically with the major- 
ity of other methods thus far employed in the field (such 
as modularity maximization [22J), since not only it is de- 
rived from first-principles, but also it is not restricted to 
purely assortative and undirected community structures. 
However, most inference methods used to obtain the most 
likely blockmodel assume that the number of communi- 
ties is known in advance |141 ITS1 I23"M26| . Unfortunately, 
in most practical cases this quantity is completely un- 
known, and one would like to infer it from the data as 
well. Here we explore a very efficient way of obtaining 
this information from the data, known as the minimum 
description length principle (MDL) [27, 28 , which predi- 
cates that the best choice of model which fits a given data 
is the one which most compresses it. i.e. minimizes the 
total amount of information required to describe it. This 
approach has been introduced in the task of blockmodel 
inference in Ref. [2j3]. Here we generalize it to accommo- 
date an arbitrarily large number of communities, and to 
obtain general bounds on the detectability of arbitrary 
community structures. We also show that, according to 
this criterion, the maximum number of detectable blocks 
scales as y/N, where N is the number of nodes in the 
network. Since the MDL approach results in a simple 
penalty on the log-likelihood, we use it to implement an 
efficient multilevel Monte Carlo algorithm with an overall 
complexity of 0(tN log N), where r is the average mix- 
ing time of the Markov chain, which can be used to infer 
arbitrary block structures on very large networks with an 
a priori unknown number of modules. 

The model — The stochastic blockmodel ensemble is 
composed of graphs with N nodes, each belonging to one 
of B blocks, and the number of edges between nodes of 
blocks r and s is given by the matrix e rs (or twice that 
number if r = s). The degree- corrected variant further 



imposes that each node i has a degree given exactly by 
ki, where the set {hi} is an additional parameter set of 
the model [30 . The directed version of both models is 
analogously defined, with e rs becoming asymmetric, and 
{k^} together with {kf} fixing the in- and out-degrees 
of the nodes, respectively. These ensembles are charac- 
terized by their microcanonical entropy S = In fi, where 
f2 is the total number of network realizations [31_. The 
entropy can be computed analytically in both cases |32j . 
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for the traditional blockmodel ensemble and, 
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for the degree corrected variant, where in both cases E = 
Srs e rs/2 is the total number of edges, n r is the number 
of nodes which belong to block r, and N k is the total 
number of nodes with degree fc, and e r = ^ s e rs is the 
number of half-edges incident on block r. The directed 
case is analogous [35] (see Supplemental Material for an 
overview) . 

The detection problem consists in obtaining the block 
partition {hi} which is the most likely, when given an 
unlabeled network G, where hi is the block label of node 
i. This is done by maximizing the log- likelihood ln"P 
that the network G is observed, given the model compat- 
ible with a chosen block partition. Since we have simply 
V = I/O, maximizing InV is equivalent to minimize the 
entropy S t / C , which is the language we will use hence- 
forth. Entropy minimization is well-defined, but only as 
long as the total number of blocks B is known before- 
hand. If this is not the case, the optimal value of S t / C 
becomes a strictly decreasing function of B. Thus, sim- 
ply minimizing the entropy will lead to the trivial B = N 
partition with each node in its own block, and the block 
matrix e rs becomes simply the adjacency matrix. A prin- 
cipled way of avoiding such overfitting is to consider the 
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total amount of information necessary to describe the 
data, which includes not only the entropy of the fitted 
model, but also the information necessary to describe the 
model itself. This quantity is called the description length 
of the data, and for the stochastic blockmodel ensemble 
it is given by 
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where C t / C is the information necessary to describe the 
model via the e rs matrix and the block assignments {hi}. 
The information-theoretic interpretation is straightfor- 
ward: The minimum value of S 4 / c is an upper bound 
on the total amount of information necessary to describe 
a given network to an observer lacking any a priori infor- 
mation Therefore, the best model chosen is the one 
which best compresses the data, which amounts to an im- 
plementation of Occam's Razor. For the specific problem 
at hand, it is easy to compute C t / C . The e rs matrix can 
be viewed as the adjacency matrix of a multigraph with 
B nodes and E edges, where the blocks are the nodes 
and self-loops are allowed. The total number of e rs ma- 
trices is then simply (( e j) ^ e total number of 
block partitions is B N . Assuming no prior information 
on what specific model is more likely, we obtain Ct by 
multiplying these numbers and taking the logarithm, 
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where h(x) = (1 + x) ln(l + x) — x\mx, and E 3> 1 
was assumed. Note that Eq. [4] is not the same as the 
expression derived in Ref. |29j . which is obtained only 
by taking the limit E 3> B 2 , in which case we have 
C t « bxE + N\nB [34 . We do not take this 

limit a priori, since, as we show below, block sizes up to 
B max \/~E can in principle be detected from empirical 
data. For the degree-corrected variant, we still need to 
describe the degree sequence of the network, hence 



C c = C t - N^p k Inpk, 



(3) 



where pk is the fraction of nodes with degree k. Note that 
for the directed case we need simply to replace B(B + 
l)/2 — > B 2 and k —> (fc~, k + ) in the equations above. 

MDL bound on detectability — The difference St, = 
£ t / c — £ t / c |s = i of the description length of a graph with 
some prescribed block structure and a fully random graph 
with B = 1 can be written as 
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x t = J2rs rn rs^ n ('mrs/w r w s ) and l c 
m rs hi(m rs /m r m s ), where m rs = e rs /2E and w r 
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FIG. 1. (a) Prescribed block structure with B — 10 and 
I t = lnB/6, together with inferred parameters for different 
(k); (b) DL per edge T,b/E as a function of B for different 
(fc), for networks sampled from (a); NMI between the true 
and inferred partitions, for the same networks as in (b); (d) 
NMI between the true and inferred partitions as a function of 
(k), for different prescribed block structures. The grey lines 
correspond to the threshold of Eq. [7] In all cases the networks 
have iV = 10 4 nodes. 



l)/2 B 2 ). We note that l t/c G [0,1ns]. If for any 
given graph we have Sf, > 0, the inferred block structure 
will be discarded in favor of the simpler fully random 
B = 1 model. Therefore the condition < yields a 
limit on the detectability of prescribed block structures 
according to the MDL criterion. For the special case 
where E 3> B 2 , this inequality translates to a more con- 
venient form. 
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The directed case is analogous, with 21nS — > \wB re- 
placed in the equation above. 

Partial detectability and parsimony — The condition 
lib < is not a statement on the absolute detectabil- 
ity of a given model, only to what extent the extracted 
information (if any) can be used to compress the data. 
Although these are intimately related, the MDL crite- 
rion is based on the idea of perfect (or lossless) compres- 
sion, and thus corresponds simply to a condition neces- 
sary (but not sufficient) for the perfect recoverability of 
the model parameters from the data. Perfect inference, 
however, is only possible in the asymptotically dense case 
(k) — > oo [18 , and in practice one always has some 
amount of uncertainty. Therefore it remains to be de- 
termined how practical is the parsimony limit derived 
from MDL to establish a noise threshold on empirically 
observed data. In Fig. [T] is shown an example of a detec- 
tion problem given an arbitrarily chosen block structure 
with B = 10 and I t = lnB/6. In Fig. [T]d is shown 
the minimum of T,b/E as function of B, for networks 
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FIG. 2. (a) NMI between the true and inferred partitions for 
planted partition samples with B — 10 as a function of c for 
different (k). The grey (red) lines correspond to the threshold 
c* of Ref. |17j (cmdl given by Eq. [7|; (b) Difference between 
c mdl an d c*, as a function of (k) for different B. 



with different (k), obtained with the Monte Carlo algo- 
rithm described below. If (k) is large enough ((fc) > 6, 
according to Eq. [7|, the minimum of Sb is clearly at 
the correct B = 10 value, and as is show in Fig. [lj> 
this is exactly where the normalized mutual information 
(NMI) [35] between the known and inferred partition is 
the largest. However, for (k) < 6 the minimum of is 
no longer at B = 10, and instead it is at B = 1. Never- 
theless, the overlap with the correct partition is overall 
positive and is still is the largest at B — 10, so the cor- 
rect partition is still to some extent detectable, but the 
MDL criterion rejects it. By experimenting with different 
planted block structures (see Fig. [TJi) , one observes that 
the MDL threshold lies very close to the parameter re- 
gion where inferred partition is no longer well correlated 
with the true partition. The comparison with the par- 
tial detectability threshold can be made in more detail 
by considering the special case known as the planted par- 
tition model (PP) |36j . which imposes a diagonal block 
structure given by m rr = c/B, m rs = (l — c)/B(B—l) for 
r =/= s. and w r — 1/B, and c € [0, 1] is a free parameter. 
In this case it can be shown that even partial inference 
is only possible if (k) > {{B - l)/{cB - l)) 2 [TO QIJ |37] , 
otherwise no information at all on the original model 
can be extracted. For smaller values of B, this bound 
is higher than Eq. [7] for this model (where we have 
X t/c = cln(Bc) + (l-c) ln(B(l-c)/(S-l)), which means 
that there is a region of parameters where the MDL cri- 
terion will discard potentially useful (albeit clearly noisy) 
information (see Fig. [2^). Interestingly, however, for 
larger values of B, the MDL criterion will most often 
result in lower bounds (see Fig.^), meaning that what- 
ever partial information which can be recovered from 
the model will not be discarded. For B — > oo we have 



c^ DL ~ 2/(k) and c* ~ l/y'(k}, and thus c^ DL < c* 
for (k) > 4. Therefore, as far as the PP model serves as 
a good representation of more general block structures, 
one should not expect excessive parsimony from MDL, 
at least as long as the value of B is sufficiently large. 

The largest detectable value of B — The MDL ap- 
proach imposes an intrinsic constraint on the maximum 



value of B which can be detected, -B ma x, given a network 
size and density. This can be obtained by minimizing £5 
over all possible block structures with a given B, which 
obtained simply by replacing Z t / C by its maximum value 
In B in Eq. [6] 
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Eq. [8] is a strictly convex function on B. This means 
there is a global minimum T,' b \B=B ms _ x given uniquely 
by N and E. It is easy to see that even if the pre- 
scribed block structure with B > B max has minimal 
entropy (i.e. T t / C — mi?), alternative partitions with 
B' < B blocks (obtained by merging blocks such that 
H t i c — lnB') will necessarily possess a smaller de- 
scription length. Imposing dTJ h jdB = 0, one obtains 
B max = ii((k))^/E, with /x((fc)) being the solution of 
/1 ln(2//i 2 + l) — (1 — l/(k))/n — [for the directed case we 
make 2/p 2 ->■ l/^ 2 and l/(k) 2/{k)\ |3S]. Therefore, 
according to the MDL criterion, the maximum number 
of blocks which is detectable scales as B max ~ V^V for 
a fixed value of (k). This is consistent with detectabil- 
ity analysis in Ref. [40 , which showed by other means 
that the traditional blockmodel parameters can only be 
recovered if the number of blocks does not scale faster 
than y/N. Note that this means that the limit E B 2 
cannot be taken a priori when inferring from empirical 
data, and hence the value of Ct computed in Ref. [55] 
needs to be replaced with Eq. [4] in the general case. 

The limit _B max cx \/E is very similar to the so-called 
"resolution limit" of community detection via modularity 
optimization [3T], which is S,^ ax = \[E. These two lim- 
its, however, have quite different origins. The value of 
B max has an information-theoretic interpretation, while 
the value of i?^ ax arises simply from the definition of 
modularity. Indeed, this limit can be to some extent alle- 
viated (but not entirely avoided) by properly modifying 
the modularity function with scale parameters |42H47| . 
On the other hand the resolution of the MDL approach 
can be improved if any a priori information is known 
which leads to a smaller class of models to be inferred. In 
general, if we have £ t = Ef(B a /E)+N\nB, where f(x) 
is any function, performing the same analysis as above 
leads to B max = (^(fc) J B) 1 /« ; with af'(p)fi + 2/(k) - 1 = 
0. Thus, e.g. if we need only to infer diagonal block struc- 
tures (as in the usual meaning of community structure), 
we would have a = 1, and the value of -B max would scale 
linearly with N. Furthermore, if the existing block struc- 
ture is locally dense (i.e. e rs ~ n r n s ), as the union of 
B complete graphs considered in [JT], the expressions in 
Eqs.[T]and[2]are no longer valid, and will overestimate the 
entropy. Using the correct entropy (Eqs. 5 and 9 in |32| ) 
will lead to an improved resolution. Unfortunately, for 
the dense case, the entropy for the degree-corrected vari- 
ant cannot be computed in a closed form, even for the 
case with "soft" degree constraints (Eqs. 12-14 in [32J). 
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FIG. 3. Top: Value of Ti^/E for both blockmodel variants as 
a function of B for (a) the American football network of [48] 
and (b) the political books network of |49| . Bottom: Inferred 
partitions with the smallest description length. Nodes circled 
in red do not match the known partitions. 
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FIG. 4. Left: Inferred block structure for the IMDB network, 
with TV = 372, 787, E = 1, 812, 657 and B = 332, according to 
the MDL criterion, and the degree-corrected stochastic block- 
model. Right: Circles correspond to film blocks, and squares 
to actors. The node colors correspond to the countries of pro- 
duction. See Supplemental Material for a higher resolution 
version, with more detailed information. 



Detection algorithm — For a fixed B, the best parti- 
tion can be found by minimizing S t / C , via well-established 
methods such as Markov chain Monte Carlo (MCMC), 
using the the Metropolis-Hastings algorithm |501 I5T] . 
However, a naive implementation based on fully ran- 
dom block membership moves can be very slow. We 
found that the performance can be drastically improved 
by using local information and current knowledge of the 
partially inferred block structure, simply by proposing 
moves r — ¥ s with a probability p(r — > s\t) oc e ts + 1, 
where t is the block label of a randomly chosen neigh- 
bor of the node being moved. Each sweep of this al- 
gorithm can be performed in O(E) time, independent 
of B (see Supplemental Material). Having obtained the 
minimum of S t / C , the best value of B is obtained via an 
independent one-dimensional minimization of E;,, using a 
Fibonacci search [S5], based on subsequent bisections of 
an initial triplet {B\ = 1,^2,^3 = i? max ), which brack- 
ets the minimum. This method finds a local minimum 



in 0(ln£? max ) time. The overall number of steps nec- 
essary for the entire algorithm is 0(tE In i? max ), where 
r is the average mixing time of the Markov chain. If 
we have no prior information on -B max , we need to as- 
sume -Bmax ~ V^E, in which case the complexity becomes 
0(tE In E), or 0(tN In N) for sparse graphs. This com- 
pares favorably to minimization strategies which require 
the computation of the full marginal probability ~k\. that 
node i belongs to block r, such as Belief-Propagation 
(BP) |17l H51 [53], which results in a larger complexity 
of 0(NB 2 ) per sweep [or 0(iVi? 2 fc max ) for the degree- 
corrected variant with A: max being the largest degree |53j]. 
or 0(N 2 ) for B = B raax . 

Empirical networks — The MDL approach yields con- 
vincing results for many empirical networks, as can be 
seen in Fig. |3j which shows results for the College Foot- 
ball network of [H] and the Political Books network 
of [49]. In both cases the correct number blocks is in- 
ferred, and the best partition matches reasonably well 
the known true values, at least for the degree-corrected 
variant. Employing the Monte Carlo algorithm above, re- 
sults may be obtained for much larger networks. We show 
in Fig. [3] the obtained block partition with the degree- 
corrected variant for the IMDB network of actors and 
films [39], where a film node is connected to all its cast 
members. The bipartiteness of the network is fully re- 
flected in the inferred block partition, where films and 
actors always belong to different blocks, although this 
has not been imposed a priori (something which would 
be impossible to obtain with, e.g. modularity optimiza- 
tion). Besides this role separation, the film blocks are 
divided sharply along spatial, temporal and genre lines, 
and the actor blocks are closely correlated with such film 
classes (see Supplemental Material for a more detailed 
analysis) . 

In summary, we showed how minimizing the full de- 
scription length of empirical network data enables sim- 
ple, efficient, unbiased and fully non-parametric analysis 
of the large-scale properties of large networks, for which 
no a priori information is available, while at the same 
time providing general bounds on the perfect decodabil- 
ity of arbitrary block structures from empirical data. 
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I. BLOCKMODEL ENTROPY 



We give here a brief overview of the entropies of the various blockmodel variants referenced in the main text. For 
more details refer to Ref [1]. The entropy of the traditional blockmodel ensemble for the undirected case is 



,H[ — \, (1) 

n r n s 



while for the directed case it reads, 



^^.H(^), (2) 



where H(x) = — x\nx — (1 — x) ln(l — x) is the binary entropy function. In both the number of edges 

from block r to s (or the number of half-edges for the undirected case when r — s), and n r is the number of nodes in 
block r. In the sparse limit, e rs <C n r n s , these expressions may be written approximately as, 

St = E-Wers\n(^- ), (3) 
2 ^ \n r n s J 
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(4) 



However, since Eqs. 1 and 2 are more general, they should in principle be preferred. On the other hand, Eqs. 3 and 4 
have the advantage that they allow the entropy difference ASt obtained by changing the block membership of a single 
node to be computed more easily. For the undirected case, for instance, we may write St = E — | ~^2 rs e rs \ne rs + 
J2 r e r \nn r , and notice that we need to modify at most Ak terms in the first sum and 2 terms in the second if we 
change the membership of a node with degree k (this also true for the degree-corrected variant below). On the other 
hand, using Eq. 1 we need to modify a number of terms which is proportional to the number of blocks B, which will 
become costly if it is much larger than typical values of k. Thus the extra precision comes at a performance cost, at 
least when using a Monte Carlo algorithm depending on block membership moves. Therefore, if one is assumes that 
the sparsity condition e rs <c n r n s is likely to be fulfilled, using Eqs. 3 and 4 can be advantageous. 
For the degree-corrected variant with "hard" degree constraints the equivalent expressions are 



S c - -E - ]>> fe lnfc! - ^ e - ln f — ) > 

k rs \ e re s J 

S d ^_ E _J2 N k+ lnfc+! -J2N k -lnk-\-J2 ^ In ( -f^) 

i,+ u- r-« \e r e s / 



(5) 



(6) 



where e r = ^ s e rs is the number of half-edges incident on block r, and e+ = ^2 s e rs and e~ = ^2 S e sr are the number 
of out- and in-edges adjacent to block r, respectively. These expressions are also only valid in the sparse limit, which 
in this case involves more details of the degree distribution, 



(k 2 ) r -(k) r (k 2 ) s - (k) s 
(kf r (kf s 



G rs / , \ 2 / 1 2 *^ ^r^s i (7) 
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where (fc') = ^2i er kl/n r (for the directed case we simply replace (k l ) — > ((k + ) l ) r and (fc') — > ((A; - )') in the 
equation above). Unfortunately there is no closed-form expression for the entropy outside the sparse limit, unlike 
the traditional variant. One can derive higher-order corrections which become relevant when the dense limit is 
approached [1], which will be different for simple and multigraphs, as well as "hard" or "soft" degree constraints, but 
will break down if the deviation from the sparse case is too strong. This could be specially problematic for networks 
with a very broad degree distribution without a structural cutoff, for which the values of (fc 2 ) are large. In such cases 
Eqs. 5 and 6 can still be used, but they must be understood as approximations, which may lead to the detection of 
spurious blocks, which simply reflect the intrinsic dissortative degree-degree correlations [1]. The minimum description 
length approach presented in the main text may alleviate this problem, since these spurious blocks may end up being 
rejected if the dissortativity is not too strong. But more care should be taken in such cases, since a more satisfying 
general methodology is still lacking. 



II. MONTE CARLO INFERENCE 



As described in the main text, the Monte Carlo Markov chain (MCMC) inference algorithm consists in using the 
Metropolis-Hastings algorithm [2, 3], where for each node i one attempts a move r — > s, where r = b% is its current 
block membership, and accept or reject the move depending on the entropy difference. The main caveat here is how 
the proposed values of s are chosen. The simplest approach is to chose randomly between all B options, which would 
lead to a correct algorithm, but with a very slow convergence to the steady state distribution for large values of B, 
since most moves would simply be rejected. Instead, we opt to propose moves which take into account the partial 
block structure inferred at the current stage of the algorithm and the local neighbourhood of the node being moved: 
We inspect a random neighbour j of the node i being moved, and obtain its block label t = bj, and we choose the new 
value s with probability proportional to e ts + 1 (this is not simply e ts to guarantee ergodicity, i.e. all values of s can be 
chosen with nonzero probability). In other words, we inspect what is the typical block neighborhood of a neighbor j 
to decide where the node i is more likely to belong. This choice is particularly appealing since it is possible to sample 
the value of s with a very simple and efficient algorithm. All we need is to write the move proposal probability as 

p(r -+ s\t) = ^L±l = (1 _ Rt) e jl + Rt JL (8) 
e t + a e t ts 

with R t = B/(e t + B). Hence, in order to sample s we proceed as follows: 1. A random neighbor j of the node i 
being moved is selected, and its block membership t = bj is obtained; 2. The value s is randomly selected from all 
B choices with equal probability; 3. With probability R t it is accepted; 4. If it is rejected, a randomly chosen edge 
adjacent to block t is chosen, and the block label s is taken from its opposite endpoint. 

This algorithm is "rejection free" (despite step 3), since it always produces a value of s with the desired probability 
after a single execution. It requires that a list of half-edges incident on each block is kept at all times. These lists 
can be updated efficiently in time 0(ki) after the move of node i, since they do not need to be ordered, and incur a 
memory complexity of 0(E), so the whole approach is very easy to implement. The value of s sampled this way still 
needs to be accepted or rejected depending on how it changes the entropy. In order for the steady state distribution 
of the Markov chain to be correct we still need to enforce detailed balance. Hence the final Metropolis-Hastings 
acceptance probability a needs to be 



I T,tPtP(r 



where p\ is the fraction of neighbours of node i which belong to block t, and p(s — > r\t) is computed after the proposed 
r — > s move (i.e. with the new values of e r t), whereas p(r — > s\t) is computed before. As mentioned in the previous 
section, the computation of S t / C can be done in 0(fc,) time, which is also the same complexity for the remaining terms 
of a. A full Monte Carlo sweep of the network can therefore be performed in O(E) time. 

If one chooses (3 = 1, the partitions are sampled with probability proportional to e~ St /°, which correspond to 
the correct posterior probability that the fitted model matches the data. This can be useful in order to sample 
the marginal probability 7r* that node i belongs to block r, which gives more detailed information on the network 
structure [4]. However, if one wants to minimize the description length, the ground state of S t / C needs to be obtained 
via j3 — > oo. This can be done by changing the value of f3 either slowly (i.e. simulated annealing [5]) or abruptly 
(i.e. greedy minimization). The former approach avoids getting trapped local minima, but can be very slow and 
requires experimentation with the cooling schedule, while the latter is more efficient, but does not guarantee that 
the optimum is found. However, for sufficiently well-pronounced block structures, both approaches should produce 
comparable results. In the following examples we use the latter, simpler approach, of abruptly cooling the system once 
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the Markov chain has equilibrated, but the overall method does not depend on how the limit f3 — > oo is eventually 
reached. 

The equilibration criterion used was to keep track of the maximum and minimum values of E t / c and stop after T 
successive sweeps occurred and both values did not change, where T is made sufficiently high so that the results no 
longer depend on it. This criterion was applied twice in a row to get over "humps" in the value of E t / C when starting 
from a previously minimized state for a larger value of B (see next section) . 

A. Minimizing the description length Y, t / C 

Since the minimum value of S t / C can be obtained independently for every value of B with the above algorithm, the 
minimum value of E t / C (or equivalently E;,) can be obtained via a one-dimensional minimization on B. The most 
appropriate algorithm in this case is called golden search (a.k.a Fibonacci search) [6]. It consists in at first bracketing 
the minimum of E& by finding a triplet (B\, B 2l B 3 ), with B\ < B 2 < B 3 such that E(,|b = b 1 > E;,|b = b 2 < E(,|s = b 3 - 
This is done easily by starting with Bi = 1, B 3 = -B max and choosing B 2 = £3 — [^3 — £iJf> wnere L x Jf ^ s the 
largest Fibonacci number smaller than x. This is repeated until the minimum is bracketed. After this, the intervals 
are progressively bisected with B 2 = B' 3 — [B' 3 — B[\ F , with {B^B'^) being the largest of the intervals (Bi,B 2 ) or 
(B 2 ,B 3 ). Depending on the value of E& | b=b' 2 , we choose the new interval as (B\ \B 2 , B' 2 , B 2 \B 3 ) or (B\ \B' 2 , B 2 , B' 2 \B 3 ), 
so that the new choice brackets the minimum. This algorithm contributes with a factor of 0(ln_B max ) to the overall 
complexity. The bisections chosen this way optimize the worse-case scenario, and guarantee that the global minimum 
is found as long as the function being minimized is convex. This needs not be the case for E;, in general, so one can 
only guarantee that a local minimum is found. However, in most cases E& has an overall convex shape, even if not 
strictly so near the minimum, since it is exactly for B — 1, less than zero for some B > 1 (if some block structure 
is detectable) and Eb — >• 00 for B — > 00. If more precision is desired, one can perform the search for different initial 
ranges on B. Furthermore, since it is based on a minimization of S t / C via Monte Carlo, it is often useful to perform 
multiple independent runs of the algorithm, and choose the best outcome. 

A crucial part of the algorithm involves obtaining the minimum partitions for the different values of B encountered 
during the bisections. A naive application of the MC sweeps described above starting from a random partition would 
discard all the work done for the previous values of B. Instead, whenever minimizing E& for a given value of B, we 
start with the previously obtained solution for the smallest B' > B, and treat nodes belonging to the same blocks as 
single nodes, weighted according to n r and the edges between them weighted according to e rs , such that the value 
of S t / C is the same. The MCMC algorithm is performed for this graph until convergence, and afterwards the process 
is repeated with the original graph, so that nodes can be moved individually. This multilevel step can decrease 
significantly the mixing time of the Markov Chain at later steps. 

The overall complexity of the entire algorithm is therefore 0(tE In B max ), where r is the average mixing time of the 
Markov chain. If we have no prior information on -B max , we need to assume -B max ~ \f~E~, in which case the complexity 
becomes O(tEItiE), or equivalently 0(r N In N) for sparse graphs with N ~ E. 

An efficient C++ implementation of this algorithm is freely available as part of the graph-tool Python library at 
http : //graph-tool . skewed. de. 

III. THE INTERNET MOVIE DATABASE (IMDB) NETWORK 

The IMDB network was constructed by considering all available records in the IMDB database, available at http: 
//www. imdb. com/interfaces. It contains comprehensive information of films, tv-shows and video games, and the 
cast of actors, as well as producers, directors, etc. Here we considered only the bipartite film-actor network, where 
a node represents either a film or an actor, and a given film is linked to its cast members. A 'film' designates any 
entry in the IMDB database, which can correspond to a theatrical release, as well as straight-to- video releases, tv- 
shows and even video games, and 'actor' designates any cast member. Extensive information is available for the 
films, including year of production, country of production, genres, and user supplied keywords. We have collected all 
available information ca. October 2012 into a network representing the full database. However, for many entries there 
is little to no additional information available, except the bare essentials such as title. Since we intend to interpret 
the overall large-scale structure, we pruned the database so that only entries with all of the mentioned metadata are 
included. Furthermore, we removed either actors which appear on only one film, and films with only one actor, since 
these entries only burden the analysis, without providing significant information on the overall network structure 
(this pruning was applied recursively, so what remained is the 2-core [7] of the network). The resulting network 
has N = 372,787 nodes (275,805 actors and 96,982 films) and E — 1,812,657 edges (the average degree is hence 
(k) w 9.72, and the actors appear on average on (k) a rts 6.57 films, and the films have on average (k) m « 18.7 actors). 
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We applied multiple runs of the algorithm above on this network, and the partition with the minimum value of 
was obtained for B = 332. A typical run of the algorithm can be seen in Fig. 1. The obtained block structure can be 
seen in Fig. 2. The most obvious feature is that the block structure obtained is also bipartite, i.e. any given block is 
either only composed of actors or films. This feature arises out of the data itself, and is not a priori imposed. It does 
however often happen that the best partition contains a small minority of 2 or 3 blocks which contain both films and 
actors. This is due to films and actors which have a very small degree, and thus the bipartiteness cannot be easily 
detected. Nevertheless, the best partitions found (i.e. with the smallest value of S c ) corresponded to a fully bipartite 
structure. 

Note that this type of fully dissortative block structure cannot be obtained with the more frequently used community 
detection methods, such as modularity optimization and many others, since they focus solely on the opposite case, 
where blocks form assortative connections. 

More insight in the uncovered structure can be obtained by inspecting Fig. 3, which shows a graphical representation 
of the block structure, and Table I which shows the metadata which most often appears in each block. Without 
resorting to a more detailed correlation analysis, certain patterns are clearly recognizable. As mentioned in the main 
text, the film blocks are divided clearly according to the year and country of production, as well as genres. In Fig. 3 
the layout is such that a rough time arrow pointing from bottom to top emerges. Films made in USA take a sizable 
portion of the graph, and are roughly divided in two main groups: Films made in the 20s-60s (a.k.a. the "Golden 
Age of Hollywood") and the more contemporary films from the 70s and onwards. In parallel one can distinguish films 
produced in European countries, such as UK, France, Italy and Germany, following a similar time line. Geographical 
and cultural similarity is also easily recognizable in Fig. 3, such as the proximity between British and American films, 
as well as Canadian and American ones, etc. In addition to this, films seem to be grouped further into genre classes. 
Although there is an abundance of seemingly nondescript "Drama" films, categories such as Animation, Western, 
Documentary, Action, Sport, Music and Adult films are clearly separated. In turn, the actors are grouped into blocks 
which seem strongly correlated with the film classes, but there are many actor blocks which are strongly connected to 
more than one film class. Looking more closely at specific well-known actors reveals intuitive patterns. As would be 
expected, actors which worked mostly together are confined to the same Block, such as the four "The Three Stooges" 
actors: Moe Howard, Larry Fine, Curly Howard, Shemp Howard, which are all confined to block 293, and most of 
their films to block 85. The same also holds for famous duos and groups such as Stan Laurel, Oliver Hardy (block 199), 
William Abbott and Lou Costello (block 231), Bud Spencer and Terrence Hill (block 253), and the Marx brothers, 
Chico Marx, Harpo Marx, Groucho Marx, Zeppo Marx (block 231), the Beatles, John Lennon, Paul McCartney, Ringo 
Starr, and George Harrison (block 274), and so on. Actors which have not worked together systematically, but made 
similar types of films or tv-shows in the same period of time also tend to be grouped together, such as contemporary 
comedians Mike Myers, Will Ferrel, Adam Sandler, Ben Stiller and Rob Schneider (block 243), and martial arts actors 
Bruce Lee and Jackie Chan (block 230) . However, the strongest factor separating the actors seems to be geographical 
location and period of activity, rather than any other professional pattern. For instance, although all Monty Python 
members (John Cleese, Eric Idle, Terry Gilliam, Graham Chapman, Michael Palin, and Terry Jones) belong to the 
same block (314), they are accompanied by <~ 1500 other actors, including Sean Connery, Roger Moore, and Alec 
Guinness, i.e. British actors active mostly during the 70s and 80s. 
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FIG. 1. Left: Typical run of the minimization algorithm for the IMDB network, for T — 1000. The inset shows the respective 
values of B. Right: Zoom in a specific region of the left figure. The first abrupt transition corresponds to the j3 — > oo cooling 
(the inset shows a further zoom in this region). The other two abrupt transitions for a smaller value of B correspond to the 
switch from the multilevel step, and then finally the /3 — > oo cooling. 



6 




FIG. 2. Inferred block structure for the IMDB network for B = 332 which minimizes E c , with a clear bipartite structure. The 
blocks in the range [0, 164] are composed only of films, and the remaining 167 blocks only of actors. Top: The inferred e rs 
matrix, Middle: Block sizes n r , Bottom: Average block degree (k) r = e r /n r . 
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information on each block is given in Table I. 



8 



(140) 1999 Argentina [Drama, Comedy] 

1 (117) 1984 Australia [Drama, Comedy] 

2 (8) 2003 Australia, USA [Drama, Comedy] 

3 (96) 2001 Austria, Germany, Switzerland 

[Drama, Comedy] 

4 (161) 1998 Belgium, France, Netherlands 
[Drama, Comedy] 
Brazil [Drama, Comedy] 
Brazil [Drama, Comedy] 
Canada, USA [Drama, Comedy] 
Canada [Drama, Comedy] 
Canada, USA [Drama, Thriller] 
Canada, USA [Drama, Comedy] 
Canada, USA [Drama, Comedy] 
Canada, USA [Drama, Thriller] 
Czech Republic, Czechoslovakia 
[Drama, Comedy] 
Denmark [Comedy, Family] 
Denmark [Drama, Comedy] 
Finland [Comedy, Drama] 
Finland [Drama, Comedy] 
Finland [Drama, Comedy] 
France [Drama, Comedy] 
France, Italy [Drama, Comedy] 
France, Italy [Drama, Comedy] 
France, UK [Adult, Horror] 
France [Drama, Comedy] 



5 (131) 

6 (134) 

7 (14) 

8 (97) 

9 (154) 

10 (54) 
51) 
120) 
81) 



12 



108) 
45) 
118) 
113) 
112) 
127) 
165) 
164) 
12) 
160) 
162) 
163) 
115) 
42) 

43) 



129) 

47) 

15) 

16) 

116) 

55) 

119) 

17) 

109) 



1982 
1994 
1982 
1995 
2001 
2002 
2005 
2005 
1994 

1966 
2002 
1949 
1984 
2000 
1944 
1963 
1977 
1988 
1991 
2000 
2004 
1935 
1993 

2000 
2001 

2005 
1994 
1977 
1999 
2000 
1981 
2002 
2005 
2000 
1956 
1969 



55 (155) 1997 Norway [Drama, Comedy] 

56 (104) 1999 Philippines [Drama, Comedy] 

57 (147) 2000 Poland [Drama, Comedy] 

58 (103) 1994 Portugal, France [Drama, Comedy] 

59 (18) 2001 Romania, USA [Drama, Horror] 

60 (50) 1996 Russia, Soviet Union 

[Drama, Comedy] 

61 (13) 1999 South Korea, Thailand, USA 

[Drama, Comedy] 

Spain, Italy [Drama, Comedy] 



France, USA, UK [Drama, 




ze [Drama, 


Comedy] 


Gern 


lany [Dran 


ra, Comedy] 


Gern 


lany, West 


Germany 


[Dran 


,a, Crime] 




Gern 


lany [Dran 


la, Crime] 


Gern 


lany, USA 


UK 


[Dran 


la, Thriller] 




Gern 


lany |Dran 


la, Comedy] 



Co 



Greece, USA [Drai 
Hong Kong, Taiwa, 
Hong Kong, China 
Hungary, USA [Dr 
India [Drama, Actic 
India Drama, Actic 



.a, Comedy 
[Action, I 
[Drama 



Action] 
cdy] 



nditi 



I Dr. 



Ro 



Ireland, UK, USA [Drama, Comedy] 
Italy, France [Drama, Comedy] 



87) 1972 



85) 

84) 

83) 

94) 

92) 

158) 

159) 

156) 

128) 

107) 

148) 

19) 

146) 



1979 
1993 
2003 
1961 
1981 
1994 
2000 
2003 
2004 
1961 
1996 
1997 
2000 



Italy, Fra; 
[West. 
Italy, Fra; 
[Drama, C 
Italy, Fra; 
Italy [Dr; 
Italy [Dr; 

Japan 
Japan 



Comedy] 

3, West Germany, USA 
cdy] 



[Comedy, Drama] 
Comedy] 
Comedy] 
Action] 
Action] 
[Animation, Action] 
[Animation, Action] 
USA [Action, Adventure] 
[Drama, Comedy] 
Mexico [Drama, Comedy] 
Mexico, USA [Drama, Comedy] 
Netherlands [Drama, Comedy] 
New Zealand, USA, Canada 
[Drama, Comedy] 



62 (132) 

63 (133) 

64 (91) 

65 (136) 

66 (110) 

67 (157) 

68 (105) 

69 (135) 

70 (139) 

71 (138) 

72 (137) 

73 (26) 

74 (25) 

75 (27) 

76 (28) 

77 (29) 

78 (20) 

79 (40) 

80 (114) 

81 (39) 

82 (38) 

83 (35) 

84 (36) 

85 (41) 

86 (31) 

87 (30) 

88 (37) 

89 (76) 

90 (33) 

91 (34) 

92 (32) 

93 (77) 

94 (101) 

95 (80) 

96 (46) 

97 (143) 

98 (79) 

99 (145) 

100 (144) 

101 (130) 

102 (93) 

103 (22) 

104 (90) 

105 (99) 

106 (2) 

107 (3) 

108 (1) 

109 (24) 



1966 
1987 
2003 
1962 
1995 
1967 
2003 
1938 
1954 
1964 
1972 
1986 
1997 
2000 
2002 
2004 
2006 
1925 
1927 
1934 
1938 
1939 
1940 
1942 
1944 
1948 
1949 
1951 
1952 
1956 
1962 
1964 
1966 
1970 
1971 
1973 
1975 
1976 
1981 
1983 
1983 

1985 
1988 
1988 
1988 
1988 
1989 
1990 



Swedei 



Turkey [Drar 
Turkey [Drar 
UK [Drama, 
UK [Drama, 
UK [Drama, 
UK [Drama, 
UK, USA [D 
UK, USA 
UK [Docu 
UK, USA 
UK [Dran 
UK [Dran 
USA [Dra 
USA [Cor 
USA [Dra 
USA [Wester: 
USA [Drar 



, Com 
Rom. 
Com. 
medy] 
mcdy] 
medy] 
medy] 



ly] 
iy] 

ledy] 
iedy] 

dy] 



cdy] 
edy] 



nentary, Comedy] 
[Drama, Thriller] 
a, Comedy] 
a, Short] 
na, Comedy] 
edy, Short] 
.a, Comedy] 



n, Acti 
Actic 



USA 
USA 



|Dr. 
[Co 
USA [Drar 
USA [Drar 
USA 
USA 



n] 

a, Comedy] 
dy, Short] 
a, Comedy] 
a, Comedy] 
, Action] 
on. Short] 
aa, Comedy] 
aa, Western] 
aa, Comedy] 
ration, Family] 
[Drama, Comedy] 
aa, Comedy] 
.ia, Comedy] 
aa, Comedy] 
imcntary. Short] 
ia, Comedy] 
aa, Comedy] 
USA [Adult, Comedy] 
USA, Iran, Argentina 
[Drama, Comedy] 
USA [Drama, Comedy] 
USA, Philippines [Action, Drama 
USA [Drama, Action] 
USA [Drama, Comedy] 
USA [Drama, Thriller] 
USA [Drama, Comedy] 
USA, UK [Documentary, Drama] 



[Wcstc 
[Anirm 
USA [Drami 
USA [Drami 
USA [Drami 
USA [Amine 
USA, UK 
USA [Drar 
USA [Drar 
USA [Drar 
USA [Doc. 
USA [Drar 
USA [Drar 



74) 

62) 

72) 

125) 

52) 

0) 

7) 

142) 

5) 

48) 

150) 

6) 

123) 
126) 
89) 
53) 



151) 

23) 

9) 

63) 

57) 

141) 

4) 

149) 



73) 
56) 
67) 



75) 

21) 

122) 

71) 

61) 

59) 

66) 

121) 

70) 



124) 

153) 

152) 

68) 

60) 

69) 

95) 

100) 

102) 



1991 
1992 
1993 
1993 
1994 
1994 
1994 
1995 
1995 
1995 
1996 
1996 
1996 
1996 
1997 
1997 
1997 
1998 



1998 
1999 
1999 
1999 
1999 
2000 
2002 
2002 
2003 
2003 
2004 
2004 
2004 
2004 
2004 
2004 
2004 
2004 
2005 
2005 
2005 
2005 
2005 
2005 
2006 
2006 
2006 
2007 
2007 
2007 
2008 
2008 
2008 
1958 

1969 

1977 



USA 
USA 
USA 
USA 
USA 
USA, 
USA 
USA 
USA, 
USA 
USA, 
USA 
USA 
USA, 
USA, 
USA, 
USA, 
USA, 
[Dram 
USA, 
[Dram 
USA 
USA 
USA 
USA 
USA 
USA 
USA 
USA 
USA, 
USA 
USA 
USA 
USA 
USA 
USA, 
USA 
USA 
USA 
USA 
USA 
USA 
USA 
USA 
USA, 
USA, 
USA 
USA 
USA 
USA, 
USA 
USA 
USA 
USA 
West 
[Com, 
West 
[D 

West 



[1JOC 




ntary, Biography] 


[Dra 


ma, 


Comedy] 


[ura 


m.K 


Comedy] 


|_'Vni 




on Family] 


[Dra 


\n'i> 


Comedy] 


UK 


P< 


jcumcntary, History] 


[Dra 




Thriller] 


[Dra 




Comedy] 


Can 


ad a 


[Drama, Thriller] 


[Dra 




Horror] 


Can 


ada 


[Drama, Thriller] 


[Dra 




Comedy] 


[Act 




Drama] 



UK [Documentary, Music] 
South Africa [Drama, Action] 
Israel [Drama, Comedy] 
UK [Documentary, Music] 
Bulgaria, Indonesia 
a, Comedy] 

Chile, Soviet Union 

1, Comedy] 
[Horror, Comedy] 
[Documentary, Drama] 
[Adult, Drama] 



[Dra, 
[Doc, 
[Dra, 
[Dra, 
[Dra, 



'dy[ 

Comedy] 
Horror] 
Comedy] 
Canada [Drama, 
[Sport, Action] 
[Drama, Comedy] 
[Music, Document; 
[Drama, Comedy] 
[Comedy, Drama] 
France [Adult, D 



edy] 



Comedy] 



[Doc, 
[Dra, 
[Dra, 
[Com 
[Ani, 
[Dra, 
[Dra, 
[Horr 



mentary, Music] 
,a, Comedy] 
a, Comedy] 
.dy, Drama] 
ation, Acti. 
,a, Comedy] 
,a, Comedy] 
>r, Comedy] 



[Comedy. Drama] 
[Drama, Comedy] 



[Comedy, Short] 

[Drama, Thriller] 

[Drama, Comedy] 

Canada [Drama, Horror] 

[Drama, Comedy] 

[Short. Drama] 

[Comedy, Drama] 

[Drama, Thriller] 
Germany, Austria 
dy, Drama] 

Germany, Italy, USA 

Comedy] 
Germany [Drama, Comedy] 



TABLE I. Aggregated metadata of the film blocks of Figs. 2 and 3. For each entry is given an unique index as labeled in Fig. 3; 
the ordering used in Fig. 2 in parentheses; the average year of production (rounded to the nearest integer); the countries of 
production, ordered according to frequency (countries which appear in < 10% of the films were omitted); and the two most 
frequent genres in the block. 
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